INTERSPEECH.2008 - Language and Multimodal | Cool Papers

#1 Coarticulation in nasal and lateral clusters in Warlpiri [PDF] [Copy] [Kimi¹]

Authors: Janet Fletcher ; Deborah Loakes ; Andrew Butcher

Indigenous Australian languages are said to show remarkable stability in C1C2 sequences with no evidence of assimilation of place of articulation. An EPG corpus of Warlpiri was examined to test the extent of spatio-temporal modification in a series of nasal and lateral/oral stop clusters that differed in place of articulation. There was evidence of limited anticipatory coarticulation in nasal clusters. Laminal palatal sonorants also exerted the strongest carryover coarticulatory effects on the following consonant although place contrasts were maintained showing that the extent of coarticulation (both spatially and temporally) is somewhat constrained by the phonological structures of the language.

#2 Phonetically prestopped laterals in Australian languages: a preliminary investigation of Warlpiri [PDF] [Copy] [Kimi¹]

Authors: Deborah Loakes ; Andrew Butcher ; Janet Fletcher ; Hywel Stoakes

Phonologically prestopped nasals occur primarily in central and southern Australian languages. Phonetically prestopped nasals on the other hand, occur in a large number of Australian languages and are not isolated to one particular region. Phonetically prestopped nasals have been analysed as a preservation of spectral characteristics at vowel-sonorant boundaries in languages which have a comparatively large number of sonorant contrasts. In this paper we describe acoustic, articulatory and durational characteristics of rarely mentioned phonetically prestopped laterals, in the Australian language Warlpiri. We conclude that like prestopped nasals, prestopped laterals are likely to be the outcome of a coarticulatory avoidance strategy to preserve the left-edge of the sonorant. While not auditorily salient, we report on the frequent distribution and very distinctive phonetic characteristics associated with prestopped laterals.

#3 Connected speech processes in Warlpiri [PDF] [Copy] [Kimi¹]

Authors: John Ingram ; Mary Laughren ; Jeff Chapman

Connected speech processes (CSP) in Warlpiri, an indigenous language of central Australia, taken from two fluent 'dreaming' monologues are analyzed with the aim of observing the influence of language particular phonological constraints and prosody upon phonetic processes of lenition, co-articulation and selective segmental enhancement.

#4 Consonant enhancement in Lamalama, an initial-dropping language of Cape York Peninsula, North Queensland [PDF] [Copy] [Kimi¹]

Author: Christina Pentland

In this paper I describe the consonant system of Lamalama and show that it is an extreme example of an initial dropping language in which consonants are enhanced or strengthened in word-initial position. Initial dropping languages have undergone a series of radical sound changes including - but not limited to - the loss of word-initial consonants. Lamalama has lost entire word-initial syllables with the result that most roots are monosyllabic C(C)V(C) forms. In a reduced phonological system various strategies are used to enhance word-initial consonants and maintain consonantal contrasts.

#5 Text, rhythm and metrical form in an Aboriginal song series [PDF] [Copy] [Kimi¹]

Author: Myfany Turpin

Setting words to (musical) rhythm is an attempt to match rhythmic positions and syllables in an aesthetically appealing manner. In English songs acceptability is based on two separate but interactive judgments: matching stress with metrically strong positions, and matching prosodic constituents with rhythmic constituents [1]. This paper investigates a genre of Aboriginal songs and finds that while prosodic and rhythmic constituents match, there is no requirement to match stress. Instead, the placement of syllables is conditioned by a caesura (word boundary rule) and a hierarchy whereby rhythmical units with fewer notes must not precede ones with more.

#6 Predicting ASR errors by exploiting barge-in rate of individual users for spoken dialogue systems [PDF] [Copy] [Kimi¹]

Authors: Kazunori Komatani ; Tatsuya Kawahara ; Hiroshi G. Okuno

We exploit the barge-in rate of individual users to predict automatic speech recognition (ASR) errors. A barge-in is a situation in which a user starts speaking during a system prompt, and it can be detected even when ASR results are not reliable. Such features not using ASR results can be a clue for managing a situation in which user utterances cannot be successfully recognized. Since individual users in our system can be identified by their phone numbers, we accumulate how often each user barges in and use this rate as a user profile for determining whether a current "barge-in" utterance should be accepted or not. We furthermore set a window that reflects the temporal transition of the user's behavior as they get accustomed to the system. Experimental results show that setting the window improves the prediction accuracy of whether the utterance should be accepted or not. The experiments also clarify the minimum window width for improving accuracy.

#7 Expanding vocabulary for recognizing user's abbreviations of proper nouns without increasing ASR error rates in spoken dialogue systems [PDF] [Copy] [Kimi¹]

Authors: Masaki Katsumaru ; Kazunori Komatani ; Tetsuya Ogata ; Hiroshi G. Okuno

Users often abbreviate long words when using spoken dialogue systems, which results in automatic speech recognition (ASR) errors. We define abbreviated words as sub-words of the original word, and add them into an ASR dictionary. The first problem is that proper nouns cannot be correctly segmented by general morphological analyzers, although long and compounded words need to be segmented in agglutinative languages such as Japanese. The second is that, as vocabulary increases, adding many abbreviated words degrades the ASR accuracy. We develop two methods, (1) to segment words by using conjunction probabilities between characters, and (2) to manipulate occurrence probabilities of generated abbreviated words on the basis of the phonological similarities between abbreviated and original words. By our method, the ASR accuracy is improved by 24.2 points for utterances containing abbreviated words, and degraded by only a 0.1 point for those containing original words.

#8 Exploiting the ASR n-best by tracking multiple dialog state hypotheses [PDF] [Copy] [Kimi¹]

Author: Jason D. Williams

When the top ASR hypothesis is incorrect, often the correct hypothesis is listed as an alternative in the ASR N-Best list. Whereas traditional spoken dialog systems have struggled to exploit this information, this paper argues that a dialog model that tracks a distribution over multiple dialog states can improve dialog accuracy by making use of the entire N-Best list. The key element of the approach is a generative model of the N-Best list given the user's true hidden action. An evaluation on real dialog data verifies that dialog accuracy rates are improved by making use of the entire N-Best list.

#9 A spoken language interpretation component for a robot dialogue system [PDF] [Copy] [Kimi¹]

Authors: Enes Makalic ; Ingrid Zukerman ; Michael Niemann

The DORIS project aims to develop a spoken dialogue module for an autonomous robotic agent. This paper examines the techniques used by Scusi?, the speech interpretation component of DORIS, to postulate and assess hypotheses regarding the meaning of a spoken utterance. The results of our evaluation are encouraging, yielding good interpretation performance for utterances of different types and lengths.

#10 MUESLI: multiple utterance error correction for a spoken language interface [PDF] [Copy] [Kimi¹]

Authors: Federico Cesari ; Horacio Franco ; Gregory K. Myers ; Harry Bratt

We propose a method for using all available information to help correct recognition errors in tasks that use constrained grammars of the kind used in the domain of Command and Control (CC) systems. In current spoken language CC systems, if there is a recognition error, the user repeats the same phrase multiple times until a correct recognition is achieved. This interaction can be frustrating for the user, especially at high levels of ambient noise. We aim to improve the accuracy of the error correction process by using all the previous information available at a given point, this being the previous utterances of the same input phrase and the knowledge that the previous result contained an error.

#11 Methods to optimize transcription of on-line media [PDF] [Copy] [Kimi¹]

Authors: Sarah Conrod ; Sara Basson ; Dimitri Kanevsky

This paper outlines the growing need to provide fast and low cost methods for providing transcripts of audio and video media to people who are deaf and hard of hearing. Outlined are three different methods for creating such transcripts including traditional manual transcription and two automatic speech recognition (ASR) methods: a semi-automatic process called shadowing and a web-based automatic transcription tool created by IBM. A pilot examining the three different methods was conducted and the results of these tests are provided and discussed, as well as potential future studies regarding the efficacy and usability of the outputs from the various methods.

#12 Discrimination of task-related words for vocabulary design of spoken dialog systems [PDF] [Copy] [Kimi¹]

Authors: Akinori Ito ; Toyomi Meguro ; Shozo Makino ; Motoyuki Suzuki

This paper describes a method used to determine if a specific word is related to a certain spoken dialog task. In most ordinary spoken dialog systems, only the words that are actually used to achieve the task are included in the vocabulary. Therefore, the system cannot recognize utterances that contain OOV words that are related to the task. Therefore, we developed a method for determining the words that are related to a specified task in order to augment the system's vocabulary. Our method is based on word similarity. We examined three similarities: word occurrence frequency on the Web, distance in a thesaurus and word similarity using LSA. The experiment revealed that the thesaurus-based and LSA-based methods have an OOV problem. To solve the problem, we developed a way to combine these two methods with theWeb-based method. In addition, we tried combining the methods using the AdaBoost algorithm.

#13 Dialog management using weighted finite-state transducers [PDF] [Copy] [Kimi¹]

Authors: Chiori Hori ; Kiyonori Ohtake ; Teruhisa Misu ; Hideki Kashioka ; Satoshi Nakamura

We are aiming to construct an expandable and adaptable dialog system which handles multiple tasks and senses users' intention via multiple modalities. A flexible platform to integrate different dialog strategies and modalities is indispensable for this purpose. In this paper, we propose an efficient approach to manage a dialog system using a weighted finite-state transducer (WFST) in which users' concept and system's action tags are input and output of the transducer, respectively. By incorporating WFSTs in dialog management, different components can easily be integrated and work on a common platform. We have constructed a prototype spoken dialog system of the Kyoto tour guide which assists users in making a plan for one-day trip to sightsee through interaction. A WFST for dialog management was created based on the annotated transcript of the Kyoto tour guide dialog corpus we recorded. The WFST was then composed with a word-to-concept WFST for language understanding, and optimized. We have confirmed our WFST-based dialog manager accepted recognition results from a speech recognizer well and worked as we designed.

#14 Probabilistic answer selection based on conditional random fields for spoken dialog system [PDF] [Copy] [Kimi¹]

Authors: Yoshitaka Yoshimi ; Ryota Kakitsuba ; Yoshihiko Nankaku ; Akinobu Lee ; Keiichi Tokuda

A probabilistic answer selection for a spoken dialog system based on Conditional Random Fields (CRFs) is described. The probabilities of answers for a question is trained by CRFs based on the lexical and morphological properties of each word, the most likely answer against the recognized word sequence of question utterance will be chosen as the system output. Various set of feature functions were evaluated on the real data of a speech oriented information kiosk system, and it is shown that the morphological properties introduces positive effects on the response accuracy. Training with recognizer output of training database instead of manual transcription was also investigated. It was also shown that this proposed scheme can achieve higher accuracy than a conventional keyword-based answer selection.

#15 Let's go lab: a platform for evaluation of spoken dialog systems with real world users [PDF] [Copy] [Kimi¹]

Authors: Maxine Eskenazi ; Alan W. Black ; Antoine Raux ; Brian Langner

This short paper is intended to advertise Let's Go Lab, a plat-form for the evaluation of spoken dialog research. Unlike other dialog platforms, in addition to example dialog data and a portable software system, Let's Go Lab affords evaluation with real users. Let's Go has served the Pittsburgh public with bus schedule information since 2005, answering more than 52,000 calls to date.

#16 The impact of language dynamics on the capitalization of broadcast news [PDF] [Copy] [Kimi¹]

Authors: Fernando Batista ; Nuno Mamede ; Isabel Trancoso

This paper investigates the impact of language dynamics on the capitalization of transcriptions of broadcast news. Most of the capitalization information is provided by a large newspaper corpus. Three different speech corpora subsets, from different time periods, are used for evaluation, assessing the importance of available training data in nearby time periods. Results are provided both for manual and automatic transcriptions, showing also the impact of the recognition errors in the capitalization task. Our approach is based on maximum entropy models, uses unlimited vocabulary, and is suitable for language adaptation. The language model for a given language period is produced by retraining a previous language model with data from that time period. The language model produced with this approach can be sorted and then pruned, in order to reduce computational resources, without much impact in the final results.

#17 Lightly supervised acoustic model training on EPPS recordings [PDF] [Copy] [Kimi]

Authors: Matthias Paulik ; Alex Waibel

Debates in the European Parliament are simultaneously translated into the official languages of the Union. These interpretations are broadcast live via satellite on separate audio channels. After several months, the parliamentary proceedings are published as final text editions (FTE). FTEs are formatted for an easy readability and can differ significantly from the original speeches and the live broadcast interpretations. We examine the impact on German word error rate (WER) when introducing supervision based on German FTEs and supervision based on German automatic translations extracted from the English and Spanish audio. We show that FTE based supervision and additional interpretation based supervision provide significant reductions in WER. We successfully apply FTE supervised acoustic model (AM) training using 143h of recordings. Combining the new AM with the mentioned supervision techniques, we achieve a significant WER reduction of 13.3% relative.

#18 Fast call-classification system development without in-domain training data [PDF] [Copy] [Kimi¹]

Authors: Christophe Servan ; Frédéric Bechet

This paper presents a new method for the fast development of call-routing systems based on pre-existing corpora and knowledge databases. This method pushes forward the reduction of specific data collection and annotation for developing a new call-classification system. No specific data collection is needed for training both for the Automatic Speech Recognition (ASR) and classification models. The main idea is to re-use existing data to train the models, according to a priori knowledge on the task targeted. The experimental framework used in this study is a call-routing system applied to a civil service information telephone application. All the a priori knowledge used to develop the system is extracted from the civil service information website as well as pre-existing corpora. The evaluation of our strategy has been made on a test corpus containing 216 utterances recorded by 10 different speakers.

#19 iCNC and iROVER: the limits of improving system combination with classification? [PDF] [Copy] [Kimi¹]

Authors: Björn Hoffmeister ; Ralf Schlüter ; Hermann Ney

We show how ROVER and confusion network combination (CNC) can be improved with classification. The general idea of improving combination with classification is that each word is assigned to a certain location and at each location a classifier decides which of the provided alternatives is most likely correct. We investigate four variations of this idea and three different classifiers, which are trained on various features derived from ASR lattices. For our experiments, we use highly optimized ROVER and CNC systems as baseline, which already give a relative reduction in WER of more than 20% for the TC-Star 2007 English task. With our methods we can further improve the result of the corresponding standard combination method.

#20 System combination for spoken language understanding [PDF] [Copy] [Kimi¹]

Authors: Stefan Hahn ; Patrick Lehnen ; Hermann Ney

One of the first steps in an SLU system usually is the extraction of flat concepts. Within this paper, we present five methods for concept tagging and give experimental results on the state-ofthe- art MEDIA corpus for both, manual transcriptions (REF) and ASR input (ASR). Compared to previous publications, some single systems could be improved and the ASR results are presented for the first time. We could improve the tagging performance of the best known result on this task by approx. 7% relatively from 16.2% to 15.0% CER for REF using light-weight system combination (ROVER). For the ASR task, we achieve improvements by approx. 3% relatively from 29.8% to 28.9% CER. An analysis of the differences in performance on both tasks is also given.

#21 Question and answer database optimization using speech recognition results [PDF] [Copy] [Kimi¹]

Authors: Shota Takeuchi ; Tobias Cincarek ; Hiromichi Kawanami ; Hiroshi Saruwatari ; Kiyohiro Shikano

The aim of this research is a human-oriented spoken dialog system which provides replies to a variety of users' utterances. The example-based response generation method searches a question and answer database (QADB) for the example question most similar to a user utterance. With this method, the system can answer a question difficult for a model to express. A QADB is constructed from question and answer pairs (QA pairs) by employing a large corpus. In order to enhance robustness to recognition errors of inarticulate utterances such as children utterances, we propose to use speech recognition results, instead of manual transcriptions, as example questions. We also introduce an optimization method that removes inappropriate QA pairs from a QADB to maximize response accuracy. We show that our method improves the response accuracy of utterances especially for children utterances in the open test.

#22 Development and evaluation of hands-free spoken dialogue system for railway station guidance [PDF] [Copy] [Kimi¹]

Authors: Hiroshi Saruwatari ; Yu Takahashi ; Hiroyuki Sakai ; Shota Takeuchi ; Tobias Cincarek ; Hiromichi Kawanami ; Kiyohiro Shikano

In this paper, we describe development and evaluation of handsfree spoken dialogue system which is used for railway station guidance. In the application at the railway station, noise robustness is the most essential issue for the dialogue system. To address the problem, we introduce two key techniques in our proposed hands-free system; (a) blind spatial subtraction array (BSSA) as a preprocessing, which can efficiently reduce nonstationary and diffuse noises in real-time, and (b) robust voice activity detection (VAD) based on speech decoding for further improvement of speech recognition accuracy. The experimental assessment of the proposed dialogue system reveals that the combination of real-time BSSA and robust VAD can provide the recognition accuracy of more than 80% under adverse railway-station noise conditions.

#23 Statistical shared plan-based dialog management [PDF] [Copy] [Kimi¹]

Authors: Amanda J. Stent ; Srinivas Bangalore

In this paper we describe a statistical shared plan-based approach to dialog modeling and dialog management. We apply this approach to a corpus of human-human spoken dialogs. We compare the performance of models trained on transcribed and automatically recognized speech, and present ideas for further improving the models.

#24 When calls go wrong: how to detect problematic calls based on log-files and emotions? [PDF] [Copy] [Kimi¹]

Authors: Ota Herm ; Alexander Schmitt ; Jackson Liscombe

Traditionally, the prediction of problematic calls in Interactive Voice Response systems in call centers has been based either on dialog state transitions and recognition log data, or on caller emotion. We present a combined model incorporating both types of feature sets that achieved 79.22% classification accuracy of problematic and non-problematic calls after only the first four turns in a human-computer dialogue. We found that using acoustic features to indicate caller emotion did not yield any significant increase of accuracy.

#25 Unsupervised learning of edit parameters for matching name variants [PDF] [Copy] [Kimi¹]

Authors: Dan Gillick ; Dilek Hakkani-Tür ; Michael Levit

Since named entities are often written in different ways, question answering (QA) and other language processing tasks stand to benefit from entity matching. We address the problem of finding equivalent person names in unstructured text. Our approach is a generalization of spelling correction: We compare to candidate matches by applying a set of edits to an input name. We introduce a novel unsupervised method for learning spelling edit probabilities which improves overall F-Measure on our own name-matching task by 12%. Relevance is demonstrated by application to the GALE Distillation task.